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Abstract 

By using concrete scenarios, we present and discuss a new concept of probabilistic Self- 
Stabilization in Distributed Systems. 


1 Introduction 

A distributed system is a network of agents that coordinate their actions by exchanging mes¬ 
sages [M]. In order to be effective, distributed systems have to be self-stabilizing [TB]: A system 
is self-stabilizing if, starting from an arbitrary state, it will quickly reach a legitimate state and 
once in a legitimate state, it will only take on legitimate states thereafter. Self-stabilization 
has important consequences: for example it allows (fast) recovery when the system is prone 
to transient faults that might take into non-legitimate states OIISIEI]. Self-stabilization is 
a mature subject in the area of Distributed Computing m and self-stabilizing algorithms for 
classical computational tasks are nowadays well-understood. 

The original concept of self-stabilization is too restrictive to properly describe some modern 
systems, e.g., P2P and social networks, which are dynamic. Indeed, several relaxations have 
been proposed so far: probabilistic self-stabilization [5D], where randomized strategies for self¬ 
stabilization are allowed; pseudo self-stabilization where the system is allowed to deviate 
from legitimate states for a finite amount of time; k-self-stabilization [3], where restrictions on 
the initial state are imposed (namely, all allowed initial states are those from which a legitimate 
state of the system can be reached by changing the state of at most k agents); weak self¬ 
stabilization IE], that only requires the existence of an execution that eventually converges to 
a legitimate state. However, 

all the above relaxations fail to capture the notion of a system that 
is self-stabilizing only with high probability and that is required 
to remain in legitimate states only over a sufficiently long time 
interval. 

The main goal of this work is to discuss a new probabilistic notion of self-stabilization that 
is general enough to apply to a wide class of complex distributed systems and suitable to derive 
algorithmic principles that induce effective and useful self-stabilizing behavior in such systems. 
In general, we say a system is self-stabilizing, according to this revised notion, if 
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a. From any state it will (quickly) converge to a legitimate state with high probability (w.h.p.), 

and 

b. Once in a legitimate state, w.h.p. it will only take on legitimate states over a sufficiently 

long time span (for instance, over an arbitrarily-large polyinomial number of steps). 

In the next section, we illustrate this new concept in reference to a specific and important 
scenario, in which probabilistic self-stabilization turned out to be an effective tool of analysis 
on one hand and naturally led to challenging open questions on the other. 

2 Self-Stabilizing Repeated Balls-into-Bins 

We consider the following repeated balls-into-bins process over an undirected graph Giy,E) 
with \V\ = n nodes. Initially, n balls are assigned to the n nodes in an arbitrary way. Then, in 
every round, one ball is chosen from each non-empty node according to some strategy (random, 
FIFO, etc) and re-assigned to one of the node’s neighbours uniformly at random. Thus, at 
every time t each node has a (possibly empty) queue of balls waiting to be forwarded, while 
every ball performs a sort of delayed random walk over the graph, the delay of each random 
walk depending on the sizes of the queues it encounters along its path. It thus follows that 
these random walks are correlated. The main issue here is to investigate the impact of such 
correlation on the maximum load. 

Inspired by previous concepts of (load) stability [UHI], we study the maximum load 
i.e., the maximum queue size at round t and we are interested in the largest achieved by 
the process over a period of (any) polynomial length. In the rest of this section we assume G is 
the complete graph. 

Applying the notion of probabilistic self-stabilization. We next discuss an approach 
that relies on the notion of probabilistic self-stabilization to bound the maximum load, also 
resulting in tighter bounds to the parallel cover time on the complete graph. In this approach, 
the state of the process at any time t is completely specified by its configuration, specifying the 
queue size of each node at time Our notion of probabilistic self-stabilization discussed above 
lends itself to a natural characterization of this process, for which it specializes as follows: 

Definition 1 (Self-Stabilizing Repeated Balls into Bins.) 

• A configuration is legitimate if its maximum load is O(logn) and a process is stable if, 
starting from any legitimate configuration, it only takes on legitimate configurations over 
a period of poly(n) length, w.h.p. 

• A process is self-stabilizing if it is stable and if, moreover, starting from any configuration, 
it reaches a legitimate configuration, w.h.p. 

• The convergence time of a self-stabilizing process is the maximum number of rounds re¬ 
quired to reach a legitimate configuration starting from any configuration. 

It is important to observe that, unlike previous concepts of self-stabilization, here there is always 
a small chance that the system leaves legitimate states even if no “external” events (e.g. faults) 
do happen. This natural notion of (probabilistic) self-stabilization was also inspired by the 
one proposed in m for other distributed processes. On the other hand, stability impacts other 

^Note that, at least to characterize the maximum load, we can assume that balls are indistinguishable. 
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important aspects of this process. For instance, if the process is stable, we can prove good upper 
bounds on the progress of a ball, namely, the number of rounds in which the ball is selected 
from its current queue and forwarded, over a sequence of t ^ 1 rounds. In turn, this provides 
a useful tool to bound the parallel cover time, i.e., the time required for every ball to visit all 
nodes (further details about this issue are given below along this section). 

Repeated-balls-into-bins: past work. To the best of our knowledge, the repeated balls- 
into-bins process was first studied in m where it is used as a crucial sub-procedure to optimize 
the message complexity of a gossip algorithm over the complete graph. The previous analysis 
in naiis] (only) holds over very-short (i.e. logarithmic) periods. On the other hand, analysis 
in [7] considers periods of arbitrary length, but it (only) yields a bound on the maximum load 
that rapidly increases with time: after t rounds, the maximum load is ©(v^) w.h.p. By adopting 
the FIFO strategy at every bin queue, the latter result easily implies that the progress of any 
ball over t consecutive rounds is ^(v^) w.h.p. Moreover, it is well known that the cover time 
for the single-ball process is w.h.p. 0(nlogn) (it is in fact equivalent to the coupon’s collector 
process [23]). These two facts easily imply an upper bound O(n^log^n) for the parallel cover 
time of the repeated balls-into-bins process on the complete graph. 

In this respect, previous analyses of the maximum load in [Tl llUllTH] are far from tight, since 
they rely on rough approximations of the process via other, much simpler Markov chains: for 
instance, in [7], the authors consider the process - which obviously dominates the original one 
- where, at the beginning of every round, a new ball is added to every empty bin. Clearly, this 
approach does not exploit a key global invariant (the fixed number n of balls) of the original 
process. Previous results are thus not helpful to establish whether this process is stable (or, 
even more, self-stabilizing). 

In [B] , our group proposed a new, tight analysis of the repeated balls-into-bins process that 
significantly departs from previous ones, showing that the system is self-stabilizing in the sense 
of Definition [TJ These results are summarized in the following 

Theorem 2 ( [6j) Let c he an arbitrarily-large constant, and let the process start from any 
legitimate configuration. The maximum load is O(logn) for all t = 0{n'’), w.h.p. More¬ 
over, starting from any configuration, the system reaches a legitimate configuration within 0{n) 
rounds, w.h.p. 

The above result strongly improves over the best previous bounds nmniT^ and it is almost tight 
(since we know that maximum load is D(logn/loglogn) at least during the first rounds |25j 1. 
Moreover, under the FIFO forwarding strategy, the progress of any ball over a sequence of 
t = poly(n) consecutive rounds is D(f/logn) w.h.p. Consequently, the parallel cover time on 
the complete graph is 0(nlog^ n) w.h.p., which is only a logn factor away from the lower bound 
following from the single-ball process. 

Wrapping up: balls-into-bins, distributed computing and self-stabilization. As men¬ 
tioned above, besides being interesting per-se, balls-into-bins processes are used to model and 
analyze several important randomized protocols in parallel and distributed computing |ll[9l[27| . 
In particular, the process we study models a natural randomized solution to the problem of 
(parallel) resource (or task) assignment in distributed systems (this problem is also known as 
traversal) [231I3H] . For a more detailed discussion of this potential application, the reader may 
refer to |B]. 

Theorem [2] also allows to study the adversarial model in which, on some faulty rounds, an 
adversary can re-assign balls to the bins in an arbitrary way. The self-stabilization property 
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and the linear convergence time shown in Theorem [2] in fact ensnre that the O(nlog^re) upper 
bound on the cover time still holds, provided faulty rounds occur with a frequency no higher 
than cn, for a sufficiently small constant c. 

2.1 Open Questions 

Below, we briefly discuss two challenging extensions to the basic problem discussed above, that 
appear natural in the probabilistic self-stabilization setting we defined. 

More balls. Consider the Repeated Balls-into-Bins process with m > n balls and n bins 
and define a state to be legitimate if the maximum load is 0{mjn ■ logn) (or, even better, 
0{mjn + logn)). Then, we can pose the same questions: Is the system self-stabilizing? Can we 
get a good upper bound on the convergence time? Experimental results performed on systems 
with parameters m ~ n log n (and increasing values of n up to 1 million) seem to suggest that 
this might be the case. Also and quite surprsingly, starting from any initial state, the system 
seems to quickly reach and remain confined to configurations that result in a constant fraction 
of empty bins. On the one hand, if proved, this property would drastically distinguish the 
behaviour of this process from the memoryless one in which, in each round, all m balls are 
randomly assigned to the n bins independently of their current positions (i.e., there are no bin 
queues). Indeed, in the latter process, a well-known result implies that, w.h.p., all bins will be 
non-empty over any polynomial number of rounds. On the other hand, convergence towards 
a constant fraction of empty bins is a key-ingredient for proving self-stabilization in the case 
m = n (i.e. Theorem n. Thus, we believe that, at least when m ^ nlogn, the system is still 
self-stabilizing. 

General Graphs. Analyzing the (probabilistic) self-stabilization properties of the repeated- 
balls-into-bins process turned out to be extremely challenging in networks other than the clique; 
This “hardness” seems to hold even for restricted classes of graphs. For instance, we tried to 
extend our analysis to the most symmetric, sparse case: the ring. Yet, while intuition and 
simulation results suggest that the process is self stabilizing in the sense defined above, we were 
not able to get any rigorous analytical results so far. 

3 Further research directions 

The notion of probabilistic self-stabilization applies to other fundamental problems in Dis¬ 
tributed Computing. One of the most interesting is the classic problem of reaching (w.h.p.) 
a stabilizing eonsensus (or even more a stabilizing majority consensus laEiiiiiiaiEi) in a dis¬ 
tributed system consisting of a finite sets of agents. In a basic variant, each of the agents initially 
supports an opinion (say a value chosen from a fixed set K of legal values). The goal here is have 
the system converge to a state in which (w.h.p.) all nodes share the same opinion and this was 
present in the initial configuration. Moreover, solutions are required to be fault-tolerant (i.e. 
stable) w.r.t. some bounded adversary that can change the values of a subset of the nodes in 
each round of the consensus process. Important advances in probabilistic versions of stabilizing 
consensus were recently made [21II1IS1IIS]. However, available concepts of stabilizing consensus 
do not fully reflect our proposed notion of probabilistic self-stabilization. Our weaker version of 
consensus may lead to more efficient and more robust protocols that still work in practice for 
most applications. 
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